load(here("Data/sim_H0_ksize.RData")) # filter size

load(here("Data/wt_exp_mat.RData"))
load(here("Data/sim_H0_expwt.RData")) # exponential weight 

load(here("Data/wt_type.RData"))
load(here("Data/sim_H0_wt_type.RData")) # shape of weight
# df_wt_type$wt_type <- factor(df_wt_type, levels = c("wt_inc", "wt_dec", "wt_u", "wt_sin", "wt_cos"))

# load(here("Data/sim_H0_stride.RData"))
N <- 100 # 100 subjects for block bootstrap
M <- 1000
N_vec <- c(100, 200, 500, 1000) # sample size used to test simple linear regression

# block size in block bootstrap
max_size <- 9

# image size 
nS <- 32

# varying parameters used for data generation
ksize_vec <- unique(df_ksize$ksize)
alpha_vec <- unique(df_expwt$alpha)
wt_type_vec <- unique(df_wt_type$wt_type)

1 Generation framework

The simulation, technically, would be established on a continuous space-time domain, even though the final “observations” would be a discrete realization.

Take a Gaussian process for example:

  1. The observation is composed of a true latent process and an error process.

\[Y_i(\mathbf{s}, t) = \eta_i(\mathbf{s}, t) +\epsilon_i(\mathbf{s}, t)\] 2. The true latent process is composed of a fixed process and a random (subject-specific) process.

\[\eta_i(\mathbf{s}, t) = \mu(\mathbf{s}, t)+b_i(\mathbf{s}, t)\]

  • \(\mu(\mathbf{s}, t)\) is the population mean function, shared across subjects
  • \(b_i(\mathbf{s}, t)\) is the individual-level random effect
  1. The error process is spatially-correlated. Correlation is introduced through a moving average random field:

\[\epsilon_i(\mathbf{s}, t) = \frac{1}{N_r}\sum_{\mathbf{s'} \in S_r}Z(\mathbf{s'}, t)\]

where:

  • \(S_r\) is a neighborhood around \(\mathbf{s}\) where the radius is r
  • \(N_r\) is the number of spacial points within neighborhood \(S_r\)
  • \(Z(\mathbf{s'}, t)\) is a white noise process

1.1 Goals

Generate 2D/3D matrices that

  • Resemble actual WMF and QSM measure from MS lesion
  • Two correlated outcomes
  • Within each outcome, spatial correlation varies across individual. For inference purposes, we also hope to control the spatial correlation.

1.2 Scaling

The moving average filter from different area size or use different weights may lead to different variation in the generated image. I hererafter call the variance of generated pixel “baseline variance”.

First, if we look at the baseline variation a single generated MWA pixel, we get:

\[Var(X)=Var(\frac{\sum_{i,j}w_{ij}{z_{ij}}}{n})=\frac{\sum_{i,j}w_{ij}^2Var(z_{ij})}{n^2} = \frac{\sum_{i,j}w_{ij}^2}{n^2} \]

This for sure would change with the weight. To make a fair comparison, let’s try if we could scale the weights so that the generated data has the same baseline variation.

Let’s try a new weight \(w_{ij}'=w_{ij}\frac{n}{\sqrt{\sum w_{ij}^2}}\), \(n\) is the number of pixels in a filter.

1.3 Effect of filter size on spatial correlation

In this section, we would like to see if the filter size of moving average would affect the spatial correlation. Here we generate 2D image of 32 by 32. Note that filter size should be odd numbers, a convention inherited from image analysis. Thought even size filter is technically possible, it is much more convenienve to use odd size for many operations since it would have a center pixel

df_ksize %>%
  filter(id==1) %>% 
  ggplot()+
  geom_tile(aes(x=s1, y=s2, fill=Y1))+
  facet_wrap(~ksize, nrow = 1)

Below I plot the simple variance and variogram function at different kernel sizes. Variogram is a measure of spatial dependence as a function of distance. Smaller value indicates greater dependence.

df_ksize %>% 
  mutate(ksize = as.factor(ksize)) %>%
  group_by(id, ksize) %>%
  summarise(var = var(Y1), .groups = "keep") %>%
  ggplot(aes(x=ksize, y=var, group = ksize))+
  geom_boxplot()+
  geom_jitter(size = 0.2)+
  geom_hline(yintercept = 1, col = "red")+
  labs(title = "Individual variance")

# df_ksize %>%
#   mutate(ksize = as.factor(ksize)) %>%
#   group_by(ksize) %>%
#   summarise(var = var(Y1), .groups = "keep")

Its interesting to see that individual variance actually dropped with larger filter size (greater spatial independence), even after scaling. If calculate variace over all subjects, they are all very close to 1. Spatial dependence probably got canceled out across many subjects.

df_ksize %>%
  filter(id <= 5) %>%
  mutate(ksize = as.factor(ksize), subj_id = id) %>%
  group_by(subj_id, ksize) %>%
  group_modify(~{variogram(Y1~1, location = ~s1+s2, data = .x)}) %>%
  ggplot()+
  geom_line(aes(x=dist, y=gamma, col=ksize))+
  geom_point(aes(x=dist, y=gamma, col=ksize))+
  geom_hline(yintercept = 1, linetype = "dashed")+
  facet_wrap(~subj_id, nrow = 1)+
  labs(y="Sample variagram")

Some observations:

  • There is a HUGE difference between white noise and moveing average from even the smallest filter.
  • Larger filter size induces stronger spatial correlation.
  • We have the elbow in most cases (when the filter is not too large), but the decrease after the elbow is a lot slower compared to the lesion data. We may need a weight matrix to make it decrease shaper.

1.4 Effect of weight in exponential weight form

Here, I would like to use weighted average within a filter. Let’s fix the kernel size to be 5, and explore effect of difference weight matrix.

I would like to construct a weight matrix that decay fast from center to border. Let’s try using inverse of the exponential of the square of Euclidean distance to center. The weights are scaled such that they are sum to one within the filter. The scaling keep the overall variation of the generated outcome comparable to the other case.

\[w_{ij} \propto exp(-\alpha d_{ij}^2)\]

  • \(\alpha\) is a positive integer controling for the rate of change of weight wrt distance.
  • \(d_{ij}\) is the Euclidean distance between point (i, j) and center of filter.
df_wt <- expand.grid(s2=1:5, s1=1:5)

for(m in seq_along(alpha_vec)){
  df_wt[, paste0("alpha", alpha_vec[m])] <- as.vector(wt_exp[,,m])
}

df_wt %>%
  pivot_longer(starts_with("alpha")) %>% 
  ggplot()+
  geom_tile(aes(x=s1, y=s2, fill=value))+
  facet_wrap(~name, nrow = 1)+
  labs(title = "Weight matrix")

Here, when \(\alpha=5\), the boarder weight can get very, very close to zero (corner pixel: 4.14e-18). The center pixel takes 97.4% of the weight. When \(\alpha=0\), the weight matrix gets us a simple average. As \(\alpha\) increase, the filter put more weight on the center pixel and less weight on the boarder pixels, leading to lower spatial dependence.

df_expwt %>% 
  filter(id==1) %>%
  ggplot()+
  geom_tile(aes(x=s1, y=s2, fill=Y1))+
  facet_wrap(~alpha, nrow = 1)+
  labs(title = "Generated data")

df_expwt %>% 
  mutate(alpha = as.factor(alpha)) %>%
  group_by(id, alpha) %>%
  summarise(var = var(Y1), .groups = "keep") %>%
  ggplot(aes(x=alpha, y=var, group = alpha))+
  geom_boxplot()+
  geom_jitter(size = 0.2)+
  geom_hline(yintercept = 1, col = "red")+
  labs(title = "Individual variance")

# df_expwt %>% 
#   filter(id == 1) %>%
#   mutate(alpha = as.factor(alpha)) %>%
#   group_by(alpha) %>%
#   summarise(var = var(Y1), .groups = "keep") 
df_expwt %>% 
  filter(id %in% 1:5) %>%
  mutate(alpha=as.factor(alpha), subj_id = id) %>%
  group_by(alpha, subj_id) %>%
  group_modify(~{variogram(Y1~1, location = ~s1+s2, data = .x)}) %>%
  ggplot()+
  geom_line(aes(x=dist, y=gamma, col=alpha))+
  geom_point(aes(x=dist, y=gamma, col=alpha))+
  geom_hline(yintercept = 1, linetype = "dashed")+
  facet_wrap(~subj_id, nrow = 1)+
  labs(y="Sample variagram for five subjects")

How come when \(\alpha=1\), the semivariogram can go over 1? This weight introduced greater variation (noise) to the data? But it does not look that way in the individual variance plot. The others ended up flatten out at 1.

1.5 Effect of functional shape of weight

The variograms from Amy took on many wild shapes! It seems that semivariance structure varies greatly across lesions, though the “peak” shape is pretty common. I am not sure how I’d be able to generate all these different spatial correlations. Intuitively, I’d like to start with different types of weights, with different functional shape wrt to Euclidean distance.

Let’s fix the filter size to be 5 by 5, and then create some different types of filters

  • A reverse U-shape: \(w_{ij} \propto (d_{ij}-1.5)^2\)
  • An decreasing function: \(w_{ij} \propto exp(-d_{ij})\). The \(max\{d_{ij}\}\) term is to scale the value of weight down to a range that is comparable to the other functions.
  • An increasing function: \(w_{ij} \propto exp(d_{ij}-max\{d_{ij}\})\)
  • A periodic function: \(w_{ij} \propto sin(2\pi d_{ij}/max(d_{ij}))\)
  • Another periodic function: \(w_{ij} \propto cos(2\pi d_{ij}/max(d_{ij}))\)

Note that the trigonometric functions have negative weight.

Here I can’t really scaled the weights as before. Since for the trigonometric functions, the sum of weights is likely to be very close to zero, causing the magnitude of scaled weight to be inflated dramatically. The generated data, then, will not have a comparable variation for sure. For example, the sum of cosine weights would be -0.47, and the scaled weight will have a range of (-53.12, 53.12).

wt_type %>%
  pivot_longer(starts_with("wt_")) %>%
  mutate(name=factor(name, levels = c("wt_inc", "wt_dec", "wt_u", "wt_sin", "wt_cos"))) %>%
  ggplot(aes(x=s1, y=s2, fill = value))+
  geom_tile()+
  facet_wrap(~name, nrow = 1)+
  labs(title = "Weight matrix")

wt_type %>%
  pivot_longer(starts_with("wt_")) %>%
  mutate(name=factor(name, levels = c("wt_inc", "wt_dec", "wt_u", "wt_sin", "wt_cos"))) %>%
  ggplot(aes(x=dist, y= value))+
  geom_line()+
  geom_point()+
  facet_wrap(~name, nrow = 1)+
  labs(title = "Weight as a function of distance")

df_wt_type %>%
  filter(id==1) %>%
  mutate(wt_type = factor(wt_type, levels = c("wt_inc", "wt_dec", "wt_u", "wt_sin", "wt_cos"))) %>%
  ggplot(aes(x=s1, y=s2, fill=Y1))+
  geom_tile()+
  facet_wrap(~wt_type, nrow = 1)+
  labs(title = "Example of generated data")

Simple variance of generated data:

df_wt_type %>% 
  group_by(wt_type, id) %>% 
  summarise(var_Y1=var(Y1), .groups = "keep") %>%
  mutate(wt_type = factor(wt_type, levels = c("wt_inc", "wt_dec", "wt_u", "wt_sin", "wt_cos"))) %>%
  ggplot(aes(x=wt_type, y = var_Y1, group=wt_type))+
  geom_boxplot()+
  geom_hline(yintercept = 1, col = "red")+
  geom_jitter(size = 0.2)

Based on previous sections, I think dec, inc and u shape induced stronger spatial correlation.

df_wt_type %>%
  filter(id<=5) %>% 
  mutate(wt_type = factor(wt_type, levels = c("wt_inc", "wt_dec", "wt_u", "wt_sin", "wt_cos")),
         subj_id = id) %>%
  group_by(subj_id, wt_type) %>%
  group_modify(~{variogram(Y1 ~ 1, locations = ~s1+s2, data = .x)}) %>%
  ggplot(aes(x=dist, y=gamma, col=wt_type))+
  geom_line()+
  geom_point()+
  facet_wrap(~subj_id, nrow = 1)+
  labs(y="Semivariance")

Some of my interpretation:

As distance increases, the spatial correlation eventually fades, so the curves stayed flat at about the baseline variation level. The major difference lies in the shorter distance part.

  • Trigonomitric functions have weight that bounces around zero, so there isn’t much change over distance. In fact, the flunctuation of sin and cos functions seems to be opposite of each other.
  • Decreasing function put more weight on neighbors, so the curve started lower than increasing function.
  • Increasing function put more weight on edges, so the curve drop below the decreasing weight at some point.

1.6 Some notes

  1. Semivariograms should really be increasing. If it is not, it is probably a result for noise or heterogeneity (variance change across location).
  • But alternatively, if we only interested in local correlation, we wouldn’t need semivariograms! We only need the estimates filter weights for that.
  1. If the goal is only to simulate data that looks like the real data, it might be better to use standard image simulation model and learn the weight matrix
  2. Block bootstrap: Hierarchical clustering

2 Bivariate correlation testing under null hypothesis

Follow up on last time, we would like to generate data from the null hypothesis where \(Y_1\), \(Y_2\) are not correlated with each other. We will also remove any fixed/random effect, leaving only the moving average error, in case any unexpected correlation is induced

We generate data from the following scheme:

\[\begin{aligned} Y_{1i}(s_1, s_2, t) &=\epsilon_{1i}(s_1, s_2, t) \\ Y_{2i}(s_1, s_2, t) &=\epsilon_{2i}(s_1, s_2, t) \\ \end{aligned}\]

  • \((s_1, s_2)\) are spatial coordinates on a 32 by 32 2D image
  • \(\epsilon_{i1}\), \(\epsilon_{i2}\) are moving average of a white noise field.
  • In fact, in this case there is no point generating across time, because all time points have the same distribution. It would just be like repeating the same thing a few more times. That leaves us room for exploring other factors, such as filter size and weight.

2.1 Effect of filter size

df_ksize %>% 
  filter(id==15) %>%
  pivot_longer(starts_with("Y")) %>%
  ggplot()+
  geom_tile(aes(x=s1, y=s2, fill = value))+
  facet_grid(cols=vars(ksize), rows = vars(name))+
  labs(title = "Generated data of subject ID = 15")

df_ksize %>% 
  filter(id %in% 1:4) %>%
  ggplot(aes(x=Y1, y=Y2))+
  geom_point(size = 0.2)+
  geom_smooth(formula = 'y~x', method = "lm")+
  stat_cor(method = "pearson")+
  facet_grid(rows = vars(id), cols = vars(ksize))+
  labs(title = "Perason correlation of four subject")

When greater spatial correlation is induced by larger filter, more significant correlation was (falsely) detected.

2.1.1 Simple linear regression

We fit a simple linear model across space conditioning on each subject and time:

\[Y_{2i}(\mathbf{s}|t) = \beta_{it0}+\beta_{it1}Y_{1i}(\mathbf{s}|t)+\epsilon_i(\mathbf{s}|t)\]

lm_err <- data.frame(ksize = unique(df_ksize$ksize))

for(n in seq_along(N_vec)){
  
  err_n <- df_ksize %>%
    filter(id <= N_vec[n]) %>%
    group_by(id, ksize) %>%
    group_modify(~{
      fit_lm <- lm(Y2~Y1, data = .x)
      data.frame(beta = coef(fit_lm)["Y1"], 
                 pval = summary(fit_lm)$coefficients["Y1", "Pr(>|t|)"])
    }) %>%
    mutate(reject = pval < 0.05) %>%
    group_by(ksize) %>%
    summarise_at("reject", mean) 
  
  lm_err[, paste0("N", N_vec[n])] <- err_n$reject
  
}

lm_err %>%
  rename("Filter size" = "ksize") %>%
  kable(table.attr = "style=\"color:black;\"") %>%
  kable_styling(full_width = F) %>%
  add_header_above(c(" " = 1, "Type I error" = "4"))
Type I error
Filter size N100 N200 N500 N1000
1 0.07 0.060 0.058 0.056
3 0.42 0.365 0.336 0.331
5 0.53 0.545 0.554 0.564
7 0.64 0.670 0.692 0.674
9 0.73 0.755 0.756 0.743

After increasing the sample size to 1000, the k=1 case had 4.4% type I error, still slightly lower than the nominal value of 5%. Does it indicate that data generated under this scheme is more likely to have false rejection under simple linear regression? Would it be because of the spatial correlation, such as the spatial correlation introduced some kind of correlation between \(Y_1\) and \(Y_2\)?

2.1.2 Block boostrap

We hope to bootstrap non-overlapping blocks of pixels to preserve some degree of correlation. While the image may not be evenly divided by the block size, we try to make the expectation of image size constant, so that each block will be sampled with equal probability. For the re-sampled image, regardless of its size, we will fit a simple linear regression model for each subject and examine the significance of the slope.

  • Let’s only use square blocks for now.
  • Let’s also fix the filter size at 5 for now.

PS. Originally, we wanted to set the resampling probability to be propotional to image size. But in fact, I think this approach couldn’t make the expectation of the pixels as the same as the original image. It would keep the expected size of the sampled image unchanged if all blocks are sampled with equal probability.

Let’s say the original image has N pixels and B blocks, and we want to resample a new image with also B blocks:

\[E(N_{new}) = E(\sum_{b=1}^Bn_b^{new}) = \sum_{b=1}^BE(n_b^{new})=\sum_{b=1}^B\sum_{b=1}^Bn_bp_b = \sum_{b=1}^B\sum_{b=1}^B n_b\frac{n_b}{N}=B\frac{\sum_{b=1}^Bn_B^2}{N} \] If we set \(p_b = 1/B\), then

\[E(N_{new}) = E(\sum_{b=1}^Bn_b^{new}) = \sum_{b=1}^BE(n_b^{new})=\sum_{b=1}^B\sum_{b=1}^Bn_bp_b = \sum_{b=1}^B\sum_{b=1}^B n_b\frac{1}{B}=\sum_{b=1}^B\frac{N}{B} = N \]

load(here("Data/block_boot_H0_ksize.RData"))
ksize_vec <- unique(df_ksize$ksize)

# calculate Type I error
lqt_mat <- apply(slope_est_ksize[ , , , ], MARGIN = 1:3, FUN =quantile, probs = 0.025)
uqt_mat <- apply(slope_est_ksize[ , , , ], MARGIN = 1:3, FUN =quantile, probs = 0.975)
reject_mat <- 0 < lqt_mat | uqt_mat < 0
typeIerror <- apply(reject_mat, 1:2, mean)
typeIerror <- data.frame(ksize = ksize_vec, typeIerror)
typeIerror %>% pivot_longer(starts_with("X")) %>%
  mutate(bsize = as.numeric(gsub("X", "", name))) %>%
  ggplot()+
  geom_line(aes(x=bsize, y=value, group=as.factor(ksize), col=as.factor(ksize)))+
  geom_point(aes(x=bsize, y=value, group=as.factor(ksize), col=as.factor(ksize)))+
  geom_hline(yintercept = 0.05, col = "black", linetype = "dashed")+
  labs(x="Block size", y="Type I error", col = "Filter size")+
  scale_x_continuous(breaks = 1:max_size)

After scaling, the endpoint of type I error is not as different as before across filter sizes. But the trend stays the same. Larger filter size led to higher type I error.

Below I show the mean and standard deviation of bootstrap slope estimates for all 100 subjects (each point represents a subject).

# mean slope estimates
slope_mean1 <- apply(slope_est_ksize, MARGIN = 1:3, FUN = mean)
array(aperm(slope_mean1, perm = c(1, 3, 2)), 
      dim =c(N*length(ksize_vec), max_size)) %>% 
  data.frame() %>%
  mutate(id = rep(1:N, each = length(ksize_vec)), 
         ksize = rep(ksize_vec, N),
         .before=1) %>%
  pivot_longer(starts_with("X")) %>%
  mutate(bsize = as.numeric(gsub("X", "", name))) %>%
  ggplot()+
  geom_boxplot(aes(x=bsize, y=value, group = bsize))+
  geom_jitter(aes(x=bsize, y=value), size = 0.2)+
  facet_wrap(~ksize, ncol = 1)+
  labs(x="Block size", title="Bootstrap mean of slope estimates")+
  scale_x_continuous(breaks = 1:max_size)+
  geom_hline(yintercept = 0, col="red", linetype = "dashed")

# mean slope estimates
slope_sd1 <- apply(slope_est_ksize, MARGIN = 1:3, FUN = sd)
array(aperm(slope_sd1, perm = c(1, 3, 2)), 
      dim =c(N*length(ksize_vec), max_size)) %>% 
  data.frame() %>%
  mutate(id = rep(1:N, each = length(ksize_vec)), 
         ksize = rep(ksize_vec, N),
         .before=1) %>%
  pivot_longer(starts_with("X")) %>%
  mutate(bsize = as.numeric(gsub("X", "", name))) %>%
  ggplot()+
  geom_boxplot(aes(x=bsize, y=value, group = bsize))+
  geom_jitter(aes(x=bsize, y=value), size = 0.2)+
  facet_wrap(~ksize, ncol = 1)+
  labs(x="Block size", title="Standard deviation of bootstrap estimates")+
  scale_x_continuous(breaks = 1:max_size)

Below I also showed an example of the bootstrap slope estimates of one subject.

array(slope_est_ksize[,,15,], dim = c(length(ksize_vec)*max_size, M)) %>% 
  data.frame() %>%
  mutate(ksize = rep(ksize_vec, max_size), 
         bsize = rep(1:max_size, each = length(ksize_vec)),
         .before = 1) %>% 
  pivot_longer(starts_with("X")) %>%
  ggplot()+
  geom_boxplot(aes(x=bsize, y=value, group=bsize))+
  geom_jitter(aes(x=bsize, y=value, group=bsize), size=0.2)+
  facet_wrap(~ksize, ncol=1)+
  labs(x="Block size", title = "Bootstrap slope estiamtes of subject 15")+
  geom_hline(yintercept = 0, col="red", linetype = "dashed")

2.2 Effect of weight matrix

Here the two parameters of interest are the weight matrix (in data generation process) and the block size (in bootstrap process). I will use the same weight matrix as Section 1.2, and the same block size as Section 2.1.

Just like before, I will generate data for 1000 subject, but only use 100 for the block bootstrap portion. The filter size for generating data is fixed at 5.

df_expwt %>% 
  filter(id==15) %>%
  mutate(alpha=as.factor(alpha)) %>%
  pivot_longer(starts_with("Y")) %>%
  ggplot()+
  geom_tile(aes(x=s1, y=s2, fill = value))+
  facet_grid(cols=vars(alpha), rows = vars(name))+
  labs(title = "Generated data of subject ID = 15")

df_expwt %>% 
  filter(id %in% 1:4) %>%
  mutate(alpha=as.factor(alpha)) %>%
  ggplot(aes(x=Y1, y=Y2))+
  geom_point(size = 0.2)+
  geom_smooth(formula = 'y~x', method = "lm")+
  stat_cor(method = "pearson")+
  facet_grid(rows = vars(id), cols = vars(alpha))+
  labs(title = "Perason correlation of four subject")

2.2.1 Simple linear regression

lm_err2 <- data.frame(alpha = alpha_vec)
# N_vec <- c(100, 200, 500, 1000)

for(n in seq_along(N_vec)){
  
  err_n <- df_expwt %>%
    filter(id <= N_vec[n]) %>%
    group_by(id, alpha) %>%
    group_modify(~{
      fit_lm <- lm(Y2~Y1, data = .x)
      data.frame(beta = coef(fit_lm)["Y1"], 
                 pval = summary(fit_lm)$coefficients["Y1", "Pr(>|t|)"])
    }) %>%
    mutate(reject = pval < 0.05) %>%
    group_by(alpha) %>%
    summarise_at("reject", mean) 
  
  lm_err2[, paste0("N", N_vec[n])] <- err_n$reject
  
}

lm_err2 %>%
  # rename("Alpha" = "ksize") %>%
  # arrange(desc(alpha)) %>%
  mutate(alpha=fct_rev(as.factor(alpha))) %>%
  kable(table.attr = "style=\"color:black;\"") %>%
  kable_styling(full_width = F) %>%
  add_header_above(c(" " = 1, "Type I error" = "4"))
Type I error
alpha N100 N200 N500 N1000
0 0.52 0.545 0.526 0.531
0.5 0.35 0.355 0.418 0.438
1 0.34 0.310 0.272 0.256
2.5 0.13 0.120 0.106 0.083
5 0.10 0.075 0.066 0.057

2.2.2 Block bootstrap

load(here("Data/block_boot_H0_expwt.RData"))

# calculate Type I error
lqt_mat2 <- apply(slope_est_expwt[ , , , ], MARGIN = 1:3, FUN =quantile, probs = 0.025)
uqt_mat2 <- apply(slope_est_expwt[ , , , ], MARGIN = 1:3, FUN =quantile, probs = 0.975)
reject_mat2 <- 0 < lqt_mat2 | uqt_mat2 < 0
typeIerror2 <- apply(reject_mat2, 1:2, mean)
typeIerror2 <- data.frame(alpha = alpha_vec, typeIerror2)
typeIerror2 %>% pivot_longer(starts_with("X")) %>%
  mutate(bsize = as.numeric(gsub("X", "", name))) %>%
  mutate(alpha=fct_rev(as.factor(alpha))) %>%
  ggplot()+
  geom_line(aes(x=bsize, y=value, group=as.factor(alpha), col=as.factor(alpha)))+
  geom_point(aes(x=bsize, y=value, group=as.factor(alpha), col=as.factor(alpha)))+
  geom_hline(yintercept = 0.05, col = "black", linetype = "dashed")+
  labs(x="Block size", y="Type I error", col = "Weight matrix")+
  scale_x_continuous(breaks = 1:max_size)

# mean slope estimates
slope_mean2 <- apply(slope_est_expwt, MARGIN = 1:3, FUN = mean)
array(aperm(slope_mean2, perm = c(1, 3, 2)), 
      dim =c(N*length(alpha_vec), max_size)) %>% 
  data.frame() %>%
  mutate(id = rep(1:N, each = length(alpha_vec)), 
         alpha = rep(alpha_vec, N),
         .before=1) %>%
  pivot_longer(starts_with("X")) %>%
  mutate(bsize = as.numeric(gsub("X", "", name))) %>%
  mutate(alpha=fct_rev(as.factor(alpha))) %>%
  ggplot()+
  geom_boxplot(aes(x=bsize, y=value, group = bsize))+
  geom_jitter(aes(x=bsize, y=value), size = 0.2)+
  facet_wrap(~alpha, ncol = 1)+
  labs(x="Block size", title="Bootstrap mean of slope estimates")+
  scale_x_continuous(breaks = 1:max_size)+
  geom_hline(yintercept = 0, col="red", linetype = "dashed")

# mean slope estimates
slope_sd2 <- apply(slope_est_expwt, MARGIN = 1:3, FUN = sd)
array(aperm(slope_sd2, perm = c(1, 3, 2)), 
      dim =c(N*length(alpha_vec), max_size)) %>% 
  data.frame() %>%
  mutate(id = rep(1:N, each = length(alpha_vec)), 
         alpha = rep(alpha_vec, N),
         .before=1) %>%
  pivot_longer(starts_with("X")) %>%
  mutate(bsize = as.numeric(gsub("X", "", name))) %>%
  mutate(alpha=fct_rev(as.factor(alpha))) %>%
  ggplot()+
  geom_boxplot(aes(x=bsize, y=value, group = bsize))+
  geom_jitter(aes(x=bsize, y=value), size = 0.2)+
  facet_wrap(~alpha, ncol = 1)+
  labs(x="Block size", title="Standard deviation of bootstrap estimates")+
  scale_x_continuous(breaks = 1:max_size)

array(slope_est_expwt[,,15,], dim = c(length(alpha_vec)*max_size, M)) %>% 
  data.frame() %>%
  mutate(alpha = rep(alpha_vec, max_size), 
         bsize = rep(1:max_size, each = length(alpha_vec)),
         .before = 1) %>% 
  pivot_longer(starts_with("X")) %>%
  mutate(alpha=fct_rev(as.factor(alpha))) %>%
  ggplot()+
  geom_boxplot(aes(x=bsize, y=value, group=bsize))+
  geom_jitter(aes(x=bsize, y=value, group=bsize), size=0.2)+
  facet_wrap(~alpha, ncol=1)+
  labs(x="Block size", title = "Bootstrap slope estiamtes of subject 15")+
  geom_hline(yintercept = 0, col="red", linetype = "dashed")

2.3 Effect of shape of weight function

In this section, filter size is fixed at 5. Data is generated from different functional shape of weight function (wrt) distance

df_wt_type %>% 
  filter(id==15) %>%
  mutate(wt_type = factor(wt_type, levels = c("wt_inc","wt_dec","wt_u","wt_sin","wt_cos"))) %>%
  pivot_longer(starts_with("Y")) %>%
  ggplot()+
  geom_tile(aes(x=s1, y=s2, fill = value))+
  facet_grid(cols=vars(wt_type), rows = vars(name))+
  labs(title = "Generated data of subject ID = 15")

df_wt_type %>% 
  filter(id %in% 1:4) %>%
  mutate(wt_type = factor(wt_type, levels = c("wt_inc","wt_dec","wt_u","wt_sin","wt_cos"))) %>%
  ggplot(aes(x=Y1, y=Y2))+
  geom_point(size = 0.2)+
  geom_smooth(formula = 'y~x', method = "lm")+
  stat_cor(method = "pearson")+
  facet_grid(rows = vars(id), cols = vars(wt_type))+
  labs(title = "Perason correlation of four subject")

2.3.1 Simple linear regression

lm_err3 <- data.frame(wt_type=unique(df_wt_type$wt_type))
# N_vec <- c(100, 200, 500, 1000)

for(n in seq_along(N_vec)){
  
  err_n <- df_wt_type %>%
    filter(id <= N_vec[n]) %>%
    group_by(id, wt_type) %>%
    group_modify(~{
      fit_lm <- lm(Y2~Y1, data = .x)
      data.frame(beta = coef(fit_lm)["Y1"], 
                 pval = summary(fit_lm)$coefficients["Y1", "Pr(>|t|)"])
    }) %>%
    mutate(reject = pval < 0.05) %>%
    group_by(wt_type) %>%
    summarise_at("reject", mean) 
  
  lm_err3[, paste0("N", N_vec[n])] <- err_n$reject
  
}
lm_err3 %>%
  mutate(wt_type = factor(wt_type, levels = c("wt_inc","wt_dec","wt_u","wt_sin","wt_cos"))) %>%
  kable(table.attr = "style=\"color:black;\"") %>%
  kable_styling(full_width = F) %>%
  add_header_above(c(" " = 1, "Type I error" = "4"))
Type I error
wt_type N100 N200 N500 N1000
wt_inc 0.10 0.145 0.162 0.165
wt_dec 0.45 0.455 0.418 0.404
wt_u 0.34 0.415 0.394 0.394
wt_sin 0.12 0.120 0.138 0.126
wt_cos 0.31 0.285 0.276 0.266

2.3.2 Block bootstrap

load(here("Data/block_boot_H0_wt_type.RData"))

# calculate Type I error
lqt_mat3 <- apply(slope_est_wt_type[ , , , ], MARGIN = 1:3, FUN =quantile, probs = 0.025)
uqt_mat3 <- apply(slope_est_wt_type[ , , , ], MARGIN = 1:3, FUN =quantile, probs = 0.975)
reject_mat3 <- 0 < lqt_mat3 | uqt_mat3 < 0
typeIerror3 <- apply(reject_mat3, 1:2, mean)
typeIerror3 <- data.frame(wt_type = wt_type_vec, typeIerror3)
typeIerror3 %>% pivot_longer(starts_with("X")) %>%
  mutate(bsize = as.numeric(gsub("X", "", name))) %>%
  mutate(wt_type = factor(wt_type, levels = c("wt_inc","wt_dec","wt_u","wt_sin","wt_cos"))) %>%
  ggplot()+
  geom_line(aes(x=bsize, y=value, group=wt_type, col=wt_type))+
  geom_point(aes(x=bsize, y=value, group=wt_type, col=wt_type))+
  geom_hline(yintercept = 0.05, col = "black", linetype = "dashed")+
  labs(x="Block size", y="Type I error", col = "Weight matrix")+
  scale_x_continuous(breaks = 1:max_size)

# mean slope estimates
slope_mean3 <- apply(slope_est_wt_type, MARGIN = 1:3, FUN = mean)
array(aperm(slope_mean3, perm = c(1, 3, 2)), 
      dim =c(N*length(wt_type_vec), max_size)) %>% 
  data.frame() %>%
  mutate(id = rep(1:N, each = length(wt_type_vec)), 
         wt_type = rep(wt_type_vec, N),
         .before=1) %>%
  pivot_longer(starts_with("X")) %>%
  mutate(bsize = as.numeric(gsub("X", "", name))) %>%
  mutate(wt_type = factor(wt_type, levels = c("wt_inc","wt_dec","wt_u","wt_sin","wt_cos"))) %>%
  ggplot()+
  geom_boxplot(aes(x=bsize, y=value, group = bsize))+
  geom_jitter(aes(x=bsize, y=value), size = 0.2)+
  facet_wrap(~wt_type, ncol = 1)+
  labs(x="Block size", title="Bootstrap mean of slope estimates")+
  scale_x_continuous(breaks = 1:max_size)+
  geom_hline(yintercept = 0, col="red", linetype = "dashed")

# mean slope estimates
slope_sd3 <- apply(slope_est_wt_type, MARGIN = 1:3, FUN = sd)
array(aperm(slope_sd3, perm = c(1, 3, 2)), 
      dim =c(N*length(wt_type_vec), max_size)) %>% 
  data.frame() %>%
  mutate(id = rep(1:N, each = length(wt_type_vec)), 
         wt_type = rep(wt_type_vec, N),
         .before=1) %>%
  pivot_longer(starts_with("X")) %>%
  mutate(bsize = as.numeric(gsub("X", "", name))) %>%
  mutate(wt_type = factor(wt_type, levels = c("wt_inc","wt_dec","wt_u","wt_sin","wt_cos"))) %>%
  ggplot()+
  geom_boxplot(aes(x=bsize, y=value, group = bsize))+
  geom_jitter(aes(x=bsize, y=value), size = 0.2)+
  facet_wrap(~wt_type, ncol = 1)+
  labs(x="Block size", title="Standard deviation of bootstrap estimates")+
  scale_x_continuous(breaks = 1:max_size)

array(slope_est_wt_type[,,15,], dim = c(length(wt_type_vec)*max_size, M)) %>% 
  data.frame() %>%
  mutate(wt_type = rep(wt_type_vec, max_size), 
         bsize = rep(1:max_size, each = length(wt_type_vec)),
         .before = 1) %>% 
  pivot_longer(starts_with("X")) %>%
  mutate(wt_type = factor(wt_type, levels = c("wt_inc","wt_dec","wt_u","wt_sin","wt_cos"))) %>%
  ggplot()+
  geom_boxplot(aes(x=bsize, y=value, group=bsize))+
  geom_jitter(aes(x=bsize, y=value, group=bsize), size=0.2)+
  facet_wrap(~wt_type, ncol=1)+
  labs(x="Block size", title = "Bootstrap slope estiamtes of subject 15")+
  geom_hline(yintercept = 0, col="red", linetype = "dashed")

  • All the curves seem to decrease and then flatten as the block size increases.
  • Trigonometric functions had relatively flat curves, possible because the change of spatial correlation across distance is flatter.
  • Decreasing function follows the regular shape, but did not drop as low as the previous case. I think this is becasue exp(-d) posed less weight on the center pixel than exp(-d^2).
  • The increasing function and the U shape fuction both had a small pike in the middle.

3 Bivariate correlation testing under alternative hypothesis

In this section, we wish to generate two spatial-dependent variables that are correlated with each other. Let’s say our generation is also based only on moving average of white noise field, we update the data generation scheme as follows:

\[\begin{aligned} Y_{1i}(s_1, s_2, t) &=\epsilon_{1i}(s_1, s_2, t) \\ Y_{2i}(s_1, s_2, t) &=\beta_0+\beta_1 Y_{1i}(s_1, s_2, t) + \epsilon_{2i}(s_1, s_2, t) \\ \end{aligned}\]

  • \((s_1, s_2)\) are spatial coordinates on a 32 by 32 2D image
  • \(\epsilon_{i1}\), \(\epsilon_{i2}\) are moving average of a white noise field.
  • \(\beta_0=0\), \(\beta_1=1\)
  • For now we do not consider individual level random-effect. The coefficients are shared across population
load(here("Data/sim_H1_ksize.RData"))
# ksize_vec <- unique(df_ksize_h1$ksize)
load(here("Data/sim_H1_expwt.RData"))
load(here("Data/sim_H1_wt_type.RData"))

3.1 Filter size

df_ksize_h1 %>% 
  filter(id==15) %>%
  pivot_longer(starts_with("Y")) %>%
  ggplot()+
  geom_tile(aes(x=s1, y=s2, fill = value))+
  facet_grid(cols=vars(ksize), rows = vars(name))+
  labs(title = "Generated data of subject ID = 15")
df_ksize_h1 %>% 
  filter(id %in% 1:4) %>%
  ggplot(aes(x=Y1, y=Y2))+
  geom_point(size = 0.2)+
  geom_smooth(formula = 'y~x', method = "lm")+
  stat_cor(method = "pearson")+
  facet_grid(rows = vars(id), cols = vars(ksize))+
  labs(title = "Perason correlation of four subject")

Since data is generated under the alternative hypothesis, we calculate power (true rejection rate) in this case.

3.1.1 Simple linear regression

It is expected that LM would end up with all rejection, considering the spatial correlation

lm_err1_h1 <- data.frame(ksize = ksize_vec)

for(n in seq_along(N_vec)){
  
  err_n <- df_ksize_h1 %>%
    filter(id <= N_vec[n]) %>%
    group_by(id, ksize) %>%
    group_modify(~{
      fit_lm <- lm(Y2~Y1, data = .x)
      data.frame(beta = coef(fit_lm)["Y1"], 
                 pval = summary(fit_lm)$coefficients["Y1", "Pr(>|t|)"])
    }) %>%
    mutate(reject = pval < 0.05) %>%
    group_by(ksize) %>%
    summarise_at("reject", mean) 
  
  lm_err1_h1[, paste0("N", N_vec[n])] <- err_n$reject
  
}

lm_err1_h1 %>%
  rename("Filter size" = "ksize") %>%
  kable(table.attr = "style=\"color:black;\"") %>%
  kable_styling(full_width = F) %>%
  add_header_above(c(" " = 1, "Power" = "4"))
Power
Filter size N100 N200 N500 N1000
1 1 1 1 1
3 1 1 1 1
5 1 1 1 1
7 1 1 1 1
9 1 1 1 1

3.1.2 Block bootstrap

load(here("Data/block_boot_H1_ksize.RData"))
# calculate Type I error
lqt_mat1_h1 <- apply(slope_est_ksize_h1[ , , , ], MARGIN = 1:3, FUN =quantile, probs = 0.025)
uqt_mat1_h1 <- apply(slope_est_ksize_h1[ , , , ], MARGIN = 1:3, FUN =quantile, probs = 0.975)
cover_mat1_h1 <- lqt_mat1_h1 < 1 & 1 < uqt_mat1_h1 
cover_rate1_h1 <- apply(cover_mat1_h1, 1:2, mean)
cover_rate1_h1 <- data.frame(ksize = ksize_vec, cover_rate1_h1)
cover_rate1_h1 %>% pivot_longer(starts_with("X")) %>%
  mutate(bsize = as.numeric(gsub("X", "", name))) %>%
  ggplot()+
  geom_line(aes(x=bsize, y=value, group=as.factor(ksize), col=as.factor(ksize)))+
  geom_point(aes(x=bsize, y=value, group=as.factor(ksize), col=as.factor(ksize)))+
  geom_hline(yintercept = 0.95, col = "black", linetype = "dashed")+
  labs(x="Block size", y="Coverate rate", col = "Filter size")+
  scale_x_continuous(breaks = 1:max_size)

Below I show the mean and standard deviation of bootstrap slope estimates for all 100 subjects (each point represents a subject).

# mean slope estimates
slope_mean1_h1 <- apply(slope_est_ksize_h1, MARGIN = 1:3, FUN = mean)
array(aperm(slope_mean1_h1, perm = c(1, 3, 2)), 
      dim =c(N*length(ksize_vec), max_size)) %>% 
  data.frame() %>%
  mutate(id = rep(1:N, each = length(ksize_vec)), 
         ksize = rep(ksize_vec, N),
         .before=1) %>%
  pivot_longer(starts_with("X")) %>%
  mutate(bsize = as.numeric(gsub("X", "", name))) %>%
  ggplot()+
  geom_boxplot(aes(x=bsize, y=value, group = bsize))+
  geom_jitter(aes(x=bsize, y=value), size = 0.2)+
  facet_wrap(~ksize, ncol = 1)+
  labs(x="Block size", title="Bootstrap mean of slope estimates")+
  scale_x_continuous(breaks = 1:max_size)+
  geom_hline(yintercept = 1, col="red", linetype = "dashed")

# mean slope estimates
slope_sd1_h1 <- apply(slope_est_ksize_h1, MARGIN = 1:3, FUN = sd)
array(aperm(slope_sd1_h1, perm = c(1, 3, 2)), 
      dim =c(N*length(ksize_vec), max_size)) %>% 
  data.frame() %>%
  mutate(id = rep(1:N, each = length(ksize_vec)), 
         ksize = rep(ksize_vec, N),
         .before=1) %>%
  pivot_longer(starts_with("X")) %>%
  mutate(bsize = as.numeric(gsub("X", "", name))) %>%
  ggplot()+
  geom_boxplot(aes(x=bsize, y=value, group = bsize))+
  geom_jitter(aes(x=bsize, y=value), size = 0.2)+
  facet_wrap(~ksize, ncol = 1)+
  labs(x="Block size", title="Standard deviation of bootstrap estimates")+
  scale_x_continuous(breaks = 1:max_size)

Below I also showed an example of the bootstrap slope estimates of one subject.

array(slope_est_ksize_h1[,,15,], dim = c(length(ksize_vec)*max_size, M)) %>% 
  data.frame() %>%
  mutate(ksize = rep(ksize_vec, max_size), 
         bsize = rep(1:max_size, each = length(ksize_vec)),
         .before = 1) %>% 
  pivot_longer(starts_with("X")) %>%
  ggplot()+
  geom_boxplot(aes(x=bsize, y=value, group=bsize))+
  geom_jitter(aes(x=bsize, y=value, group=bsize), size=0.2)+
  facet_wrap(~ksize, ncol=1)+
  labs(x="Block size", title = "Bootstrap slope estiamtes of subject 15")+
  geom_hline(yintercept = 1, col="red", linetype = "dashed")

3.2 Exponential weight

df_expwt_h1 %>% 
  filter(id==15) %>%
  # mutate(alpha=fct_rev(as.factor(alpha))) %>%
  pivot_longer(starts_with("Y")) %>%
  ggplot()+
  geom_tile(aes(x=s1, y=s2, fill = value))+
  facet_grid(cols=vars(alpha), rows = vars(name))+
  labs(title = "Generated data of subject ID = 15")

df_expwt_h1 %>% 
  # filter(id==1) %>%
  filter(id %in% 1:4) %>%
  # mutate(alpha=fct_rev(as.factor(alpha))) %>%
  ggplot(aes(x=Y1, y=Y2))+
  geom_point(size = 0.2)+
  geom_smooth(formula = 'y~x', method = "lm")+
  stat_cor(method = "pearson")+
  facet_grid(rows = vars(id), cols = vars(alpha))+
  labs(title = "Perason correlation of four subject")

3.2.1 Simple linear regression

lm_err2_h1 <- data.frame(alpha = alpha_vec)
# N_vec <- c(100, 200, 500, 1000)

for(n in seq_along(N_vec)){
  
  err_n <- df_expwt_h1 %>%
    filter(id <= N_vec[n]) %>%
    group_by(id, alpha) %>%
    group_modify(~{
      fit_lm <- lm(Y2~Y1, data = .x)
      data.frame(beta = coef(fit_lm)["Y1"], 
                 pval = summary(fit_lm)$coefficients["Y1", "Pr(>|t|)"])
    }) %>%
    mutate(reject = pval < 0.05) %>%
    group_by(alpha) %>%
    summarise_at("reject", mean) 
  
  lm_err2_h1[, paste0("N", N_vec[n])] <- err_n$reject
  
}

lm_err2_h1 %>%
  # rename("Alpha" = "ksize") %>%
  # arrange(desc(alpha)) %>%
  kable(table.attr = "style=\"color:black;\"") %>%
  kable_styling(full_width = F) %>%
  add_header_above(c(" " = 1, "Power" = "4"))
Power
alpha N100 N200 N500 N1000
0.0 1 1 1 1
0.5 1 1 1 1
1.0 1 1 1 1
2.5 1 1 1 1
5.0 1 1 1 1

3.2.2 Block bootstrap

We will follow the same set up as Section 2.2.

load(here("Data/block_boot_H1_expwt.RData"))
# calculate coverage rate
lqt_mat2_h1 <- apply(slope_est_expwt_h1[ , , , ], MARGIN = 1:3, FUN =quantile, probs = 0.025)
uqt_mat2_h1 <- apply(slope_est_expwt_h1[ , , , ], MARGIN = 1:3, FUN =quantile, probs = 0.975)
cover_mat2_h1 <- lqt_mat2_h1 < 1 & 1 < uqt_mat2_h1 
cover_rate2_h1 <- apply(cover_mat2_h1, 1:2, mean)
cover_rate2_h1 <- data.frame(alpha = alpha_vec, cover_rate2_h1)
cover_rate2_h1 %>% pivot_longer(starts_with("X")) %>%
  mutate(bsize = as.numeric(gsub("X", "", name))) %>%
  # mutate(alpha=as.factor(alpha)) %>%
  ggplot()+
  geom_line(aes(x=bsize, y=value, group=as.factor(alpha), col=as.factor(alpha)))+
  geom_point(aes(x=bsize, y=value, group=as.factor(alpha), col=as.factor(alpha)))+
  geom_hline(yintercept = 0.95, col = "black", linetype = "dashed")+
  labs(x="Block size", y="Coverate rate", col = "Weight matrix")+
  scale_x_continuous(breaks = 1:max_size)

Below I show the mean and standard deviation of bootstrap slope estimates for all 100 subjects (each point represents a subject).

# mean slope estimates
slope_mean2_h1 <- apply(slope_est_expwt_h1, MARGIN = 1:3, FUN = mean)
array(aperm(slope_mean2_h1, perm = c(1, 3, 2)), 
      dim =c(N*length(alpha_vec), max_size)) %>% 
  data.frame() %>%
  mutate(id = rep(1:N, each = length(alpha_vec)), 
         alpha = rep(alpha_vec, N),
         .before=1) %>%
  pivot_longer(starts_with("X")) %>%
  mutate(bsize = as.numeric(gsub("X", "", name))) %>%
  mutate(alpha=as.factor(alpha)) %>%
  ggplot()+
  geom_boxplot(aes(x=bsize, y=value, group = bsize))+
  geom_jitter(aes(x=bsize, y=value), size = 0.2)+
  facet_wrap(~alpha, ncol = 1)+
  labs(x="Block size", title="Bootstrap mean of slope estimates")+
  scale_x_continuous(breaks = 1:max_size)+
  geom_hline(yintercept = 1, col="red", linetype = "dashed")

New discovery: first row seems slightly biased! Does strong spatial correlation also affects the mean of bootstrap estimation?

# mean slope estimates
slope_sd2_h1 <- apply(slope_est_expwt_h1, MARGIN = 1:3, FUN = sd)
array(aperm(slope_sd2_h1, perm = c(1, 3, 2)), 
      dim =c(N*length(alpha_vec), max_size)) %>% 
  data.frame() %>%
  mutate(id = rep(1:N, each = length(alpha_vec)), 
         alpha = rep(alpha_vec, N),
         .before=1) %>%
  pivot_longer(starts_with("X")) %>%
  mutate(bsize = as.numeric(gsub("X", "", name))) %>%
  mutate(alpha=as.factor(alpha)) %>%
  ggplot()+
  geom_boxplot(aes(x=bsize, y=value, group = bsize))+
  geom_jitter(aes(x=bsize, y=value), size = 0.2)+
  facet_wrap(~alpha, ncol = 1)+
  labs(x="Block size", title="Standard deviation of bootstrap estimates")+
  scale_x_continuous(breaks = 1:max_size)

Below I also showed an example of the bootstrap slope estimates of one subject.

array(slope_est_expwt_h1[,,15,], dim = c(length(alpha_vec)*max_size, M)) %>% 
  data.frame() %>%
  mutate(alpha = rep(alpha_vec, max_size), 
         bsize = rep(1:max_size, each = length(alpha_vec)),
         .before = 1) %>% 
  pivot_longer(starts_with("X")) %>%
  mutate(alpha=as.factor(alpha)) %>%
  ggplot()+
  geom_boxplot(aes(x=bsize, y=value, group=bsize))+
  geom_jitter(aes(x=bsize, y=value, group=bsize), size=0.2)+
  facet_wrap(~alpha, ncol=1)+
  labs(x="Block size", title = "Bootstrap slope estiamtes of subject 15")+
  geom_hline(yintercept = 1, col="red", linetype = "dashed")

3.3 Functional shape of weight

df_wt_type_h1 %>% 
  filter(id==15) %>%
  mutate(wt_type = factor(wt_type, levels = c("wt_inc", "wt_dec", "wt_u", "wt_sin", "wt_cos"))) %>%
  pivot_longer(starts_with("Y")) %>%
  ggplot()+
  geom_tile(aes(x=s1, y=s2, fill = value))+
  facet_grid(cols=vars(wt_type), rows = vars(name))+
  labs(title = "Generated data of subject ID = 15")

df_wt_type_h1 %>% 
  # filter(id==1) %>%
  filter(id %in% 1:4) %>%
  mutate(wt_type = factor(wt_type, levels = c("wt_inc", "wt_dec", "wt_u", "wt_sin", "wt_cos"))) %>%
  ggplot(aes(x=Y1, y=Y2))+
  geom_point(size = 0.2)+
  geom_smooth(formula = 'y~x', method = "lm")+
  stat_cor(method = "pearson")+
  facet_grid(rows = vars(id), cols = vars(wt_type))+
  labs(title = "Perason correlation of four subject")

3.3.1 Simple linear regression

lm_err3_h1 <- data.frame(wt_type = wt_type_vec)
# N_vec <- c(100, 200, 500, 1000)

for(n in seq_along(N_vec)){
  
  err_n <- df_wt_type_h1 %>%
    filter(id <= N_vec[n]) %>%
    group_by(id, wt_type) %>%
    group_modify(~{
      fit_lm <- lm(Y2~Y1, data = .x)
      data.frame(beta = coef(fit_lm)["Y1"], 
                 pval = summary(fit_lm)$coefficients["Y1", "Pr(>|t|)"])
    }) %>%
    mutate(reject = pval < 0.05) %>%
    group_by(wt_type) %>%
    summarise_at("reject", mean) 
  
  lm_err3_h1[, paste0("N", N_vec[n])] <- err_n$reject
  
}

lm_err3_h1 %>%
  mutate(wt_type = factor(wt_type, levels = c("wt_inc", "wt_dec", "wt_u", "wt_sin", "wt_cos"))) %>%
  kable(table.attr = "style=\"color:black;\"") %>%
  kable_styling(full_width = F) %>%
  add_header_above(c(" " = 1, "Power" = "4"))
Power
wt_type N100 N200 N500 N1000
wt_inc 1 1 1 1
wt_dec 1 1 1 1
wt_u 1 1 1 1
wt_sin 1 1 1 1
wt_cos 1 1 1 1

3.3.2 Block bootstrap

load(here("Data/block_boot_H1_wt_type.RData"))
# calculate Type I error
lqt_mat3_h1 <- apply(slope_est_wttype_h1[ , , , ], MARGIN = 1:3, FUN =quantile, probs = 0.025)
uqt_mat3_h1 <- apply(slope_est_wttype_h1[ , , , ], MARGIN = 1:3, FUN =quantile, probs = 0.975)
cover_mat3_h1 <- lqt_mat3_h1 < 1 & 1 < uqt_mat3_h1 
cover_rate3_h1 <- apply(cover_mat3_h1, 1:2, mean)
cover_rate3_h1 <- data.frame(wt_type = wt_type_vec, cover_rate3_h1)
cover_rate3_h1 %>% pivot_longer(starts_with("X")) %>%
  mutate(bsize = as.numeric(gsub("X", "", name))) %>%
  mutate(wt_type=factor(wt_type, levels = c("wt_inc", "wt_dec", "wt_u", "wt_sin", "wt_cos"))) %>%
  ggplot()+
  geom_line(aes(x=bsize, y=value, group=wt_type, col=wt_type))+
  geom_point(aes(x=bsize, y=value, group=wt_type, col=wt_type))+
  geom_hline(yintercept = 0.95, col = "black", linetype = "dashed")+
  labs(x="Block size", y="Coverate rate", col = "Weight matrix")+
  scale_x_continuous(breaks = 1:max_size)

# mean slope estimates
slope_mean3_h1 <- apply(slope_est_wttype_h1, MARGIN = 1:3, FUN = mean)
array(aperm(slope_mean3_h1, perm = c(1, 3, 2)), 
      dim =c(N*length(wt_type_vec), max_size)) %>% 
  data.frame() %>%
  mutate(id = rep(1:N, each = length(wt_type_vec)), 
         wt_type = rep(wt_type_vec, N),
         .before=1) %>%
  pivot_longer(starts_with("X")) %>%
  mutate(bsize = as.numeric(gsub("X", "", name))) %>%
  mutate(wt_type=factor(wt_type, levels = c("wt_inc", "wt_dec", "wt_u", "wt_sin", "wt_cos"))) %>%
  ggplot()+
  geom_boxplot(aes(x=bsize, y=value, group = bsize))+
  geom_jitter(aes(x=bsize, y=value), size = 0.2)+
  facet_wrap(~wt_type, ncol = 1)+
  labs(x="Block size", title="Bootstrap mean of slope estimates")+
  scale_x_continuous(breaks = 1:max_size)+
  geom_hline(yintercept = 1, col="red", linetype = "dashed")

We also seem some bias in the first row (increasing weight)

# mean slope estimates
slope_sd3_h1 <- apply(slope_est_wttype_h1, MARGIN = 1:3, FUN = sd)
array(aperm(slope_sd3_h1, perm = c(1, 3, 2)), 
      dim =c(N*length(wt_type_vec), max_size)) %>% 
  data.frame() %>%
  mutate(id = rep(1:N, each = length(wt_type_vec)), 
         wt_type = rep(wt_type_vec, N),
         .before=1) %>%
  pivot_longer(starts_with("X")) %>%
  mutate(bsize = as.numeric(gsub("X", "", name))) %>%
  mutate(wt_type=factor(wt_type, levels = c("wt_inc", "wt_dec", "wt_u", "wt_sin", "wt_cos"))) %>%
  ggplot()+
  geom_boxplot(aes(x=bsize, y=value, group = bsize))+
  geom_jitter(aes(x=bsize, y=value), size = 0.2)+
  facet_wrap(~wt_type, ncol = 1)+
  labs(x="Block size", title="Standard deviation of bootstrap estimates")+
  scale_x_continuous(breaks = 1:max_size)

Below I also showed an example of the bootstrap slope estimates of one subject.

array(slope_est_wttype_h1[,,15,], dim = c(length(wt_type_vec)*max_size, M)) %>% 
  data.frame() %>%
  mutate(wt_type = rep(wt_type_vec, max_size), 
         bsize = rep(1:max_size, each = length(wt_type_vec)),
         .before = 1) %>% 
  pivot_longer(starts_with("X")) %>%
  mutate(wt_type=factor(wt_type, levels = c("wt_inc", "wt_dec", "wt_u", "wt_sin", "wt_cos"))) %>%
  ggplot()+
  geom_boxplot(aes(x=bsize, y=value, group=bsize))+
  geom_jitter(aes(x=bsize, y=value, group=bsize), size=0.2)+
  facet_wrap(~wt_type, ncol=1)+
  labs(x="Block size", title = "Bootstrap slope estiamtes of subject 15")+
  geom_hline(yintercept = 1, col="red", linetype = "dashed")

4 Introduced bivariate correlation through spatial intercorrelation

Why couldn’t the block bootstrap approach reach nominal type I error/coverage rate? One speculation is that through generating the two outcomes separately from the same mechanism to introduce spatial intercorrelation, we introduced correlation between the two variables.

To explore that, I wanna look into the residuals after subtracting the effect of \(Y_1\) on \(Y_2\), in the original image and the bootstrap samples.

Let’s start with one subject (subject 15). I will do the block bootstrap with different block size, and visualize the residual inter-correlation. This turned out to be quite a challenging problem, because to plot the semivariogram, we need each pixel to have a location index. And it is not straight forward to do that, putting the sampled blocks into the original 2D format.

My way is to create a 2D matrix that I am sure that can fit all the blocks, and put the blocks one by one into their slot, leaving some empty pixels here and there.

4.1 Filter size

# subset subject 15
df_id15 <- df_ksize %>% filter(id == 15)

# # simple linear regression
# df_id15 %>% 
#   group_by(ksize) %>%
#   group_modify(~{this_fit <- lm(Y2~Y1, data = .x)
#                  data.frame(res = this_fit$residuals,
#                             pred = this_fit$fitted.values)}) %>%
#   ggplot(aes(x=pred, y=res))+
#   geom_point(size = 0.2)+
#   geom_smooth(formula = y~x)+
#   facet_wrap(~ksize)
# 
# try_fit <- lm(Y2~Y1, data = df_id15)
# try_fit$fitted.values
# 
# lapply(lm_list, function(x){x$residuals}) %>%
#   bind_rows()

Out of curiosity, I would like to look at the distribution of residuals in the presense of spatial correlation

# containers
res_list <- list()
for(k in seq_along(ksize_vec)){ # filter size for data generation
  
  res_list[[k]] <- list()
  
  # A matrix of observed outcome
  this_df <- df_id15 %>% filter(ksize==ksize_vec[k])
  this_df[, paste0("res_k", 1:max_size)] <- NA
  Y_mat <- matrix(this_df$Y2, nS, nS)
  
  
  for(b in 1:max_size){ # block size for bootstrap
    
      
      # divide the matrix into blocks
      rblock <- (row(Y_mat)-1)%/%b+1
      cblock <- (col(Y_mat)-1)%/%b+1
      block_id_mat <- (rblock-1)*max(cblock) + cblock
      nblock <- max(block_id_mat)
      
      # sample blocks
      # sample the same number of blocks as the original image
      this_df$block_id <- as.vector(block_id_mat)
      block_list <- split(this_df, f = this_df$block_id)
      
      # for(m in 1:M){
      boot_block <- sample(1:nblock, size = nblock, replace = T)
      boot_list <- block_list[boot_block]
      names(boot_list) <- as.character(1:nblock)
      boot_df <- bind_rows(boot_list, .id="new_block_id")
      
      # fit model
      boot_lm <-  lm(Y2~Y1, data = boot_df)
      # slope_est5[k, b] <- boot_lm$coefficients["Y1"]
      boot_df$res <- boot_lm$residuals
      new_boot_list <- split(boot_df, f=boot_df$new_block_id)
      
      # put residuals back to its original 2D format
      res_mat <- matrix(NA, nrow = sqrt(nblock)*b, ncol = sqrt(nblock)*b)
      rblock2 <- (row(res_mat)-1)%/%b+1
      cblock2 <- (col(res_mat)-1)%/%b+1
      block_id_mat2 <- (rblock2-1)*max(cblock2) + cblock2
      res_df <- data.frame(block_id_mat2) %>% mutate(s1 = 1:(sqrt(nblock)*b), .before=1) %>%
        pivot_longer(starts_with("X"), names_to = "s2", values_to = "block_id") %>%
        mutate(s2 = gsub("X", "", s2)) %>% 
        arrange(block_id, s2) %>% 
        mutate(res=NA)

      for(l in 1:nblock){
        res_df$res[res_df$block_id==l][1:nrow(new_boot_list[[l]])] <- new_boot_list[[l]]$res
      }
      
      res_list[[k]][[b]] <- res_df
      
  }
}
t2 <- Sys.time()
  • Please pay attention to the different scale of y-axis scale of the following figures.
# put the residuals back in the 2D image
for(k in seq_along(res_list)){
  res_plt <- res_list[[k]] %>%
    lapply(function(x){variogram(res~1, locations = ~s1+s2, data = x %>% filter(!is.na(res))
                                 %>% mutate_all(as.numeric))}) %>% 
    bind_rows(.id="block_size") %>%
    ggplot(aes(x=dist, y=gamma, col=as.factor(block_size)))+
    geom_line()+
    geom_point()+
    labs(title = paste0("Filter size = ", ksize_vec[k]),
         y = "Variogram", col = "Block size")
  print(res_plt)
  
}

From the variogram of residuals, for individual subjects, strong spatial correlation remained after block bootstrap, and the size of block did not make much difference. I am still thinking about ways to integrally showing that for multiple bootstrap iterations of multiple subjects.

Once we do a full boostrap, for one subject at one filter size and one block size, we’ll have 1000 semivariogram curves (that is, basically, for each colored curve above, we’ll have 1000 copies). Can we just simply plot them and smooth them?

5 References